-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable JetStream on disk errors #6292
Conversation
835c8bd
to
a309fcf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably have better detection of a wider array of filesystem errors like permissions etc, agreed, but I'm not entirely sure why this would be a configurable option and I don't particularly like it either.
The disk being pulled out from underneath JetStream is effectively a catastrophic operational issue and for JS to continue to try to operate in those circumstances feels problematic, i.e. with the potential for data loss that no one may even notice. Even more so in a clustered scenario where the metalayer and Raft logs are being written out to disk.
Thank you for taking the time to review this. |
Reopening PR, as it was closed mistakenly |
Thank you for the suggestion @neilalexander. I have implemented the recommended change and removed the flag. |
Hello @neilalexander, Happy New Year. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having the filestore.go
call DisableJetStream
directly may not be the best approach here. In a number of places we already surface errors up to stream.go
, i.e. for isOutOfSpaceErr
.
It feels like the more natural thing to do here would be to extend that same pattern, such that the stream examines surfaced errors and reacts, similar to how we do with the disk out-of-space warnings. Otherwise we end up with different types of disk errors being handled differently and that makes the code difficult to trace.
server/filestore.go
Outdated
fs.expireMsgsOnRecover() | ||
err := fs.expireMsgsOnRecover() | ||
if err != nil && err == errFileSystemPermissionDenied { | ||
fs.srv.Warnf("file system permission denied while expiring msgs, disabling jetstream: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure any log lines are capitalised/cased correctly.
server/filestore.go
Outdated
os.Remove(mb.mfn) | ||
err := os.Remove(mb.mfn) | ||
if err != nil && os.IsPermission(err){ | ||
return errFileSystemPermissionDenied |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to do this rather than just check os.IsPermission
further up the callstack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, I will change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may even make sense to define a new error-detect function for this case, similar to how we already have this in store.go
for out-of-space:
func isOutOfSpaceErr(err error) bool {
return err != nil && (strings.Contains(err.Error(), "no space left"))
}
I agree; it makes sense to handle this within the same pattern. I have implemented the change as you suggested. |
9b4d96d
to
0caaf62
Compare
server/filestore_test.go
Outdated
@@ -8562,3 +8563,81 @@ func TestFileStoreDontSpamCompactWhenMostlyTombstones(t *testing.T) { | |||
fmb.bytes /= 2 | |||
require_True(t, fmb.shouldCompactInline()) | |||
} | |||
|
|||
func TestStoreRawMessageThrowsPermissionErrorIfFSModeReadOnly(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All tests in filestore_test.go
need to start TestFileStore
, otherwise CI won't pick them up properly.
server/filestore.go
Outdated
fs.writeFullState() | ||
err := fs.writeFullState() | ||
if isPermissionError(err) { | ||
fs.warn("File system permission denied when flushing stream state, disabling jetstream %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Capitalisation JetStream
, pop a :
before the %v
too.
server/stream.go
Outdated
// messages in block cache could be lost in the worst case. | ||
// In the clustered mode it is very highly unlikely as a result of replication. | ||
mset.srv.DisableJetStream() | ||
mset.srv.Warnf("File system permission denied while writing msg, disabling jetstream: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same nit here with capitalisation etc.
Filesystem permission denied while writing message, disabling JetStream: %v
Signed-off-by: Sourabh Agrawal <[email protected]>
Signed-off-by: Sourabh Agrawal <[email protected]>
491222c
to
0cde6ae
Compare
Thank you, @neilalexander. All the comments have been addressed. |
Overall looks OK, but CI is reporting It doesn't actually look like anything meaningful in that file has changed, so should just be able to do |
Hello, @neilalexander. I have made the suggested change, and CI checks are now passing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Currently, there are scenarios where NATS JetStream may encounter permission errors when file system goes into read only mode, which can lead to an inconsistent state. In such cases, the system continues to allow publishing messages by resetting stream state, leading to a misaligned consumer stream sequence. This PR introduces changes to gracefully handle these permission errors and prevent NATS from continuing in an inconsistent state when: - Flushing stream state to disk - Deleting expired messages on startup - Creating new block for messages After this PR, If NATS is running in non-clustered mode, the user will be unable to issue write requests until the issue is resolved. In clustered mode, only the affected node will stop accepting requests, while the system will continue to function as long as the required quorum remains healthy. PR potentially fixes : #6211 which leads to consumer sequence reaching higher than stream sequence. Signed-off-by: Sourabh Agrawal <[email protected]>
Currently, there are scenarios where NATS JetStream may encounter permission errors when file system goes into read only mode, which can lead to an inconsistent state. In such cases, the system continues to allow publishing messages by resetting stream state, leading to a misaligned consumer stream sequence.
This PR introduces changes to gracefully handle these permission errors and prevent NATS from continuing in an inconsistent state when:
After this PR, If NATS is running in non-clustered mode, the user will be unable to issue write requests until the issue is resolved. In clustered mode, only the affected node will stop accepting requests, while the system will continue to function as long as the required quorum remains healthy.
PR potentially fixes : #6211 which leads to consumer sequence reaching higher than stream sequence.
Signed-off-by: Sourabh Agrawal [email protected]